{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1> Repeatable splitting </h1>\n",
    "\n",
    "In this notebook, we will explore the impact of different ways of creating machine learning datasets.\n",
    "\n",
    "<p>\n",
    "\n",
    "Repeatability is important in machine learning. If you do the same thing now and 5 minutes from now and get different answers, then it makes experimentation difficult. In other words, you will find it difficult to gauge whether a change you made has resulted in an improvement or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --user google-cloud-bigquery==1.25.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Restart** the kernel before proceeding further (On the Notebook menu - Kernel - Restart Kernel).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> Create a simple machine learning model </h3>\n",
    "\n",
    "The dataset that we will use is <a href=\"https://console.cloud.google.com/bigquery?project=bigquery-samples&p=bigquery-samples&d=airline_ontime_data&t=flights&page=table\">a BigQuery public dataset</a> of airline arrival data. Click on the link, and look at the column names. Switch to the Details tab to verify that the number of records is 70 million, and then switch to the Preview tab to look at a few rows.\n",
    "<p>\n",
    "We want to predict the arrival delay of an airline based on the departure delay. The model that we will use is a zero-bias linear model:\n",
    "$$ delay_{arrival} = \\alpha * delay_{departure} $$\n",
    "<p>\n",
    "To train the model is to estimate a good value for $\\alpha$. \n",
    "<p>\n",
    "One approach to estimate alpha is to use this formula:\n",
    "$$ \\alpha = \\frac{\\sum delay_{departure} delay_{arrival} }{  \\sum delay_{departure}^2 } $$\n",
    "Because we'd like to capture the idea that this relationship is different for flights from New York to Los Angeles vs. flights from Austin to Indianapolis (shorter flight, less busy airports), we'd compute a different $alpha$ for each airport-pair.  For simplicity, we'll do this model only for flights between Denver and Los Angeles."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Naive random split (not repeatable) </h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "compute_alpha = \"\"\"\n",
    "#standardSQL\n",
    "SELECT \n",
    "    SAFE_DIVIDE(\n",
    "    SUM(arrival_delay * departure_delay),\n",
    "    SUM(departure_delay * departure_delay)) AS alpha\n",
    "FROM\n",
    "(\n",
    "    SELECT\n",
    "        RAND() AS splitfield,\n",
    "        arrival_delay,\n",
    "        departure_delay\n",
    "    FROM\n",
    "        `bigquery-samples.airline_ontime_data.flights`\n",
    "    WHERE\n",
    "        departure_airport = 'DEN'\n",
    "        AND arrival_airport = 'LAX'\n",
    ")\n",
    "WHERE\n",
    "    splitfield < 0.8\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9759292430838074\n"
     ]
    }
   ],
   "source": [
    "results = bigquery.Client().query(compute_alpha).to_dataframe()\n",
    "alpha = results['alpha'][0]\n",
    "print(alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> What is wrong with calculating RMSE on the training and test data as follows? </h3>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dataset</th>\n",
       "      <th>rmse</th>\n",
       "      <th>num_flights</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>train</td>\n",
       "      <td>13.034724</td>\n",
       "      <td>63913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>eval</td>\n",
       "      <td>13.280670</td>\n",
       "      <td>15776</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  dataset       rmse  num_flights\n",
       "0   train  13.034724        63913\n",
       "1    eval  13.280670        15776"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compute_rmse = \"\"\"\n",
    "#standardSQL\n",
    "SELECT\n",
    "    dataset,\n",
    "    SQRT(\n",
    "        AVG(\n",
    "            (arrival_delay - ALPHA * departure_delay) *\n",
    "            (arrival_delay - ALPHA * departure_delay)\n",
    "        )\n",
    "    ) AS rmse,\n",
    "    COUNT(arrival_delay) AS num_flights\n",
    "FROM (\n",
    "    SELECT\n",
    "        IF (RAND() < 0.8, 'train', 'eval') AS dataset,\n",
    "        arrival_delay,\n",
    "        departure_delay\n",
    "    FROM\n",
    "        `bigquery-samples.airline_ontime_data.flights`\n",
    "    WHERE\n",
    "        departure_airport = 'DEN'\n",
    "        AND arrival_airport = 'LAX' )\n",
    "GROUP BY\n",
    "    dataset\n",
    "\"\"\"\n",
    "bigquery.Client().query(\n",
    "    compute_rmse.replace('ALPHA', str(alpha))).to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hint:\n",
    "* Are you really getting the same training data in the compute_rmse query as in the compute_alpha query?\n",
    "* Do you get the same answers each time you rerun the compute_alpha and compute_rmse blocks?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3> How do we correctly train and evaluate? </h3>\n",
    "<br/>\n",
    "Here's the right way to compute the RMSE using the actual training and held-out (evaluation) data. Note how much harder this feels.\n",
    "\n",
    "Although the calculations are now correct, the experiment is still not repeatable.\n",
    "\n",
    "Try running it several times; do you get the same answer?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_and_eval_rand = \"\"\"\n",
    "#standardSQL\n",
    "WITH\n",
    "    alldata AS (\n",
    "        SELECT\n",
    "            IF (RAND() < 0.8, 'train', 'eval') AS dataset,\n",
    "            arrival_delay,\n",
    "            departure_delay\n",
    "        FROM\n",
    "            `bigquery-samples.airline_ontime_data.flights`\n",
    "        WHERE\n",
    "        departure_airport = 'DEN'\n",
    "        AND arrival_airport = 'LAX' ),\n",
    "    training AS (\n",
    "        SELECT\n",
    "            SAFE_DIVIDE(\n",
    "                SUM(arrival_delay * departure_delay),\n",
    "                SUM(departure_delay * departure_delay)) AS alpha\n",
    "        FROM\n",
    "            alldata\n",
    "        WHERE\n",
    "            dataset = 'train' )\n",
    "\n",
    "SELECT\n",
    "    MAX(alpha) AS alpha,\n",
    "    dataset,\n",
    "    SQRT(\n",
    "        AVG(\n",
    "            (arrival_delay - alpha * departure_delay) *\n",
    "            (arrival_delay - alpha * departure_delay)\n",
    "        )\n",
    "    ) AS rmse,\n",
    "    COUNT(arrival_delay) AS num_flights\n",
    "FROM\n",
    "    alldata,\n",
    "    training\n",
    "GROUP BY\n",
    "    dataset\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>alpha</th>\n",
       "      <th>dataset</th>\n",
       "      <th>rmse</th>\n",
       "      <th>num_flights</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.976553</td>\n",
       "      <td>train</td>\n",
       "      <td>13.105478</td>\n",
       "      <td>63682</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.976553</td>\n",
       "      <td>eval</td>\n",
       "      <td>12.997229</td>\n",
       "      <td>16007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      alpha dataset       rmse  num_flights\n",
       "0  0.976553   train  13.105478        63682\n",
       "1  0.976553    eval  12.997229        16007"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bigquery.Client().query(train_and_eval_rand).to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Using HASH of date to split the data </h2>\n",
    "\n",
    "Let's split by date and train."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9758039143620403\n"
     ]
    }
   ],
   "source": [
    "compute_alpha = \"\"\"\n",
    "#standardSQL\n",
    "SELECT \n",
    "    SAFE_DIVIDE(\n",
    "        SUM(arrival_delay * departure_delay),\n",
    "        SUM(departure_delay * departure_delay)) AS alpha\n",
    "FROM\n",
    "    `bigquery-samples.airline_ontime_data.flights`\n",
    "WHERE\n",
    "    departure_airport = 'DEN'\n",
    "    AND arrival_airport = 'LAX'\n",
    "    AND ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8\n",
    "\"\"\"\n",
    "results = bigquery.Client().query(compute_alpha).to_dataframe()\n",
    "alpha = results['alpha'][0]\n",
    "print(alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the alpha to compute RMSE. Because the alpha value is repeatable, we don't need to worry that the alpha in the compute_rmse will be different from the alpha computed in the compute_alpha."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  dataset       rmse  num_flights\n",
      "0    eval  12.764691        15671\n",
      "1   train  13.160713        64018\n"
     ]
    }
   ],
   "source": [
    "compute_rmse = \"\"\"\n",
    "#standardSQL\n",
    "SELECT\n",
    "    IF(ABS(MOD(FARM_FINGERPRINT(date), 10)) < 8, 'train', 'eval') AS dataset,\n",
    "    SQRT(\n",
    "        AVG(\n",
    "            (arrival_delay - ALPHA * departure_delay) *\n",
    "            (arrival_delay - ALPHA * departure_delay)\n",
    "        )\n",
    "    ) AS rmse,\n",
    "    COUNT(arrival_delay) AS num_flights\n",
    "FROM\n",
    "    `bigquery-samples.airline_ontime_data.flights`\n",
    "WHERE\n",
    "    departure_airport = 'DEN'\n",
    "    AND arrival_airport = 'LAX'\n",
    "GROUP BY\n",
    "    dataset\n",
    "\"\"\"\n",
    "print(bigquery.Client().query(\n",
    "    compute_rmse.replace('ALPHA', str(alpha))).to_dataframe().head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note also that the RMSE on the evaluation dataset more from the RMSE on the training dataset when we do the split correctly.  This should be expected; in the RAND() case, there was leakage between training and evaluation datasets, because there is high correlation between flights on the same day.\n",
    "<p>\n",
    "This is one of the biggest dangers with doing machine learning splits the wrong way -- <b> you will develop a false sense of confidence in how good your model is! </b>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2018 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
